Goto

Collaborating Authors

 universal agent


Self-Predictive Universal AI

Neural Information Processing Systems

Reinforcement Learning (RL) algorithms typically utilize learning and/or planning techniques to derive effective policies. The integration of both approaches has proven to be highly successful in addressing complex sequential decision-making challenges, as evidenced by algorithms such as AlphaZero and MuZero, which consolidate the planning process into a parametric search-policy. AIXI, the most potent theoretical universal agent, leverages planning through comprehensive search as its primary means to find an optimal policy. Here we define an alternative universal agent, which we call Self-AIXI, that on the contrary to AIXI, maximally exploits learning to obtain good policies. It does so by self-predicting its own stream of action data, which is generated, similarly to other TD(0) agents, by taking an action maximization step over the current on-policy (universal mixture-policy) Q-value estimates. We prove that Self-AIXI converges to AIXI, and inherits a series of properties like maximal Legg-Hutter intelligence and the self-optimizing property.


Self-Predictive Universal AI

Neural Information Processing Systems

Reinforcement Learning (RL) algorithms typically utilize learning and/or planning techniques to derive effective policies. The integration of both approaches has proven to be highly successful in addressing complex sequential decision-making challenges, as evidenced by algorithms such as AlphaZero and MuZero, which consolidate the planning process into a parametric search-policy. AIXI, the most potent theoretical universal agent, leverages planning through comprehensive search as its primary means to find an optimal policy. Here we define an alternative universal agent, which we call Self-AIXI, that on the contrary to AIXI, maximally exploits learning to obtain good policies. It does so by self-predicting its own stream of action data, which is generated, similarly to other TD(0) agents, by taking an action maximization step over the current on-policy (universal mixture-policy) Q-value estimates.


Pixel to policy: DQN Encoders for within & cross-game reinforcement learning

Agrawal, Ashrya, Shah, Priyanshi, Prakash, Sourabh

arXiv.org Artificial Intelligence

Reinforcement Learning can be applied to various tasks, and environments. Many of these environments have a similar shared structure, which can be exploited to improve RL performance on other tasks. Transfer learning can be used to take advantage of this shared structure, by learning policies that are transferable across different tasks and environments and can lead to more efficient learning as well as improved performance on a wide range of tasks. This work explores as well as compares the performance between RL models being trained from the scratch and on different approaches of transfer learning. Additionally, the study explores the performance of a model trained on multiple game environments, with the goal of developing a universal game-playing agent as well as transfer learning a pre-trained encoder using DQN, and training it on the same game or a different game. Our DQN model achieves a mean episode reward of 46.16 which even beats the human-level performance with merely 20k episodes which is significantly lower than deepmind's 1M episodes. The achieved mean rewards of 533.42 and 402.17 on the Assault and Space Invader environments respectively, represent noteworthy performance on these challenging environments.


Adversarial Navigation Mesh Alteration

Hale, David Hunter (Univeristy of North Carolina at Charlotte) | Youngblood, G. Michael (Univeristy of North Carolina at Charlotte)

AAAI Conferences

Game environments are becoming more and more mutable from the actions of both Players and Non Player Characters (NPCs). However, current generation AI agents do not take advantage of the tactical abilities these mutable worlds provide. We propose a method to make the game agents aware of the mutability of the world by extending their repertoire of abilities to include world alteration commands and some evaluation functions, which determine when and where to alter the world for the greatest tactical gain. Primarily, our work focuses on the Adversarial Navigation Mesh Alteration (ANMA) algorithm, which evaluates potential changes to the map in adversarial environments from an attacker and defender point of view. We present an empirical evaluation of the ANMA algorithm in a Capture The Flag (CTF) simulation environment with several teams of agents. One group of agents (adaptive) lacks the ability to initiate world deformations, but they can respond and re-plan to take advantage of world modifications. The second team of agents (builders) can only generate additional paths through the world using the attacker portion of ANMA. The third team of agents (universal) is able to fully deform the world by generating new paths or removing existing paths using both the attacker and defender sections of ANMA. We evaluated these teams and observed that builder agents beat adaptive agents at a rate of 1.33 to 1. The more advanced universal agents beat adaptive agents at a rate of 2.75 to 1 and builder agents 1.4 to 1.